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ABSTRACT 

We  present  a  linear-time  algorithm  for  the  weighted  median  problem,  which 
arises  as  a  subproblem  in  certain  multivariate  optimization  problems,  including 
linear  Li  approximittior.  The  algorithm,  which  is  stated  recursively,  is  based  on 
the  well  known  lioear-time  algorithm  for  the  standard  median  problem.  Compu- 
tational experience  with  the  linear  medi?.n  algorithms  for  both  the  unweighted 
and  weighted  cases  is  descriLjod. 
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1.  Introduction 

The  weighted  median  problem  b  the  following:  given  n  unordered  real  numbers  Xi,  •  •  •  ,2, 
and  associated  positive  real  weights  Wi,  ■  ■  ■  ,w^,  solve: 

Tmnf{x)=tw,\x-x,\.  (1) 

If  the  weights  are  all  one,  the  pioblem  reduces  to  that  of  finding  a  median  of  n  numbers,  for 
which  a  linear-time  algorithm  is  weli  known  (see  ilj,[5)). 

The  problem  (1)  arises  as  a  subproblem  -  namely  tie  line  search  -  in  methods  for  solving  the 
multidimensional  linear  Lj  probleni  (  j2j,  [3],  [4]): 

miu  f]  I  aJx-0,  \ ,  where  ^  t  R"" ,  /3,  f  R, 
and  a  method  [7]  for  minimizing  a  sum  of  Euclidean  norms: 

min  ^  I  IA/7-*,  I  I2,  where  A,  t  R"^',  b^  t  R' . 

These  line  searches  are  generally  implemented  by  a  partial  sorting  technique,  using  either  heap- 
sort  or  quicksort.  The  problem  (1)  is  easily  solved  by  such  a  method.  To  see  this  let  us  Erst 
derive  optimality  conditions.  It  is  easily  seen  that  the  left  and  right  derivatives  of  f  are  respec- 
tively given  by: 

/-'(')=  E  ^.-  E  ^, 

//(x)=  E  "'.-  E  "'.• 
1:1,  >  I    1:1,  <i 

Since  f  is  convex  and  piecewise  linear,  a  necessary  and  sufficient  condition  for  x  to  solve  (1)  is  : 

/_'(j)<Oand/+'(j)>0. 
or  equivalently, 

wtsumi<wtsum2+  wtsum^ 

wtsumi+  wt8um2^tutsumi 


•  This  work  was  snpported  in  pirt  by  National  Science  Foundation  Grant  MCS-81-01924  and  in  part  by  the 
Office  of  Naval  Kesearch  Gradaate  Fellowship  Program. 
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where 

wtsumi=    V]   w,,  u!tsum2=    5]   "'i.  and  li'/sumj  =    J]  w,  (2) 

Clearly,  a  solution  will  always  be  given  by  one  of  the  points  r.,  although  there  may  in  some  cases 
be  a  continuum  of  solutions  in  the  interval  between  two  of  the  points. 

An  algorithm  based  on  partial  heapsorting  therefore  is  described  as  follows: 

1.  Let  8=0,  t='S2w,. 

19=1 

2.  Form  the  {r, }  in  a  heap. 

3.  While  »</  do 

extract  the  smallest  of  the  remaining  {z, } 
from  the  heap,  say  Xj,  and  readjis'.  the  heap. 

3*-3+  Wj-,   t*-t-Wj. 

It  is  clear  that  the  algorithm  terminates  with  j:,   an  optin-.al  oolutioa.  Step  2  of  the  algorithm 
requires  0(n)  operations  and  Step  3  requires  O(nlogn)  (5ee  [l]). 

The  purpose  of  this  paper  is  (a)  to  describ*  a  '.eneralization  of  the  well  known  linear-time 
median  algorithm  to  solve  (1)  in  linear  tuut.  'td  (b)  .o  J<  j>ib,  cciuputational  experience  with 
the  "fast"  linear  algorithms  for  both  tUc  m^^,^^  aud  «.<ight^u  median  problems.  We  investigate 
several  different  forms  of  implemeniitiun  aad  «,aipi;imU,  v.a!<.ijr.inc  how  large  n  must  be  to  make 
the  algorithm  practical. 

2.  A  Linear-Time  Algorithm  for  "«de  Vi  eiguie<i  ProDltm 

A  linear-time  algorithm  for  the  unweighted  median  prookm  is  described  recursively  as  fol- 
lows (see  [l]).  The  function  is  called  initially  with  ar^umentj  j— I,  {^i,  •  •  •  ,x„},  n. 

function  SELECT(t,S,n) 

(*  find  the  kth  smallest  element  of  the  n  elements  in  the  set  S  *) 
if  n<64  then  use  a  partial  heapsort  to  find  the  kth  clement 
else  begin 

divide  S  into  1-^1  strips  of  5  elements  each 


t  strip 

P  be  the  set  of  the  med    „ 

If 


sort  each  5-e!ement  strip 

let  MEDL\NSTRIP  be  the  set  of  the  medians  of  the  5-element  strips 


mm^  SELECT(I-^  ,  MEDLVNSTRIP, 

let  5,, 52,5s  be  the  sets  of  elements  of  S 

less  than,  equal  to  and  greater  than  mm,  respectively,  and 
let  Hi, Mo  and  n^  be  the  sizes  of  these  sets. 
if  n,>Jk  then  return(SELECT(it,5i,rj,)) 
else  if  n,H-  nz>k  then  return( mm) 

else  retum(SELECT(*-n,-n2,5s,n3)) 
end 

The  key  to  generalizing  this  algorithm  to  solve  the  weighted  median  problem  is  to  write  a 
second  procedure  WTSELECT,  with  a  similar  structure,  except  that  the  first  procedure  call 
within  WTSELECT  b  to  SELECT  and  the  second  call,  in  the  if  statement,  is  to  WTSELECT 


itself.  The  fact  that  the  first  call  is  to  the  unweighted  median  algorithm  ensures  retaining  the 
even  split  which  gives  the  linear-time  performance.  The  second  call  to  VVTSELECT  guarantees 
that  the  algorithm  finds  the  weighted  solution.  A  weight  parameter,  balance,  is  passed  in  this 
recursive  call.  Specifically,  the  weighted  median  is  found  by  calling  the  following  with  arguments 
{xi,  ■  ■  ■  ,Zn},  {wi,  ■  ■■  ,Wn},0,  n: 

function  WTSELECT(X,W,6a/ance,n) 

(*  find  I,  such  that  balance+  wl3umi<wt8um2+  wtsuTUi 

and  balanced  wt8umi+  wtstim^^ivtsum^ 

where  wt8um,i  =  1,2,3,  are  defined  by  (2).  *) 
if  n  <64  then  use  a  partial  heapsort  to  find  the  weighted  median 

else  begin 


W 


divide  X  into  j— j  strips  of  5  elements  each 

sort  each  5-element  strip 

let  MEDIANSTRIP  be  the  set  of  the  medians  of  the  5-element  strips 


mm^SELECT(|-^  ,  MEDIANSTRIP, 

let  $  ,?2,'^  ■■*)€  ±^,sete  of  elements  of  X 

less  than,:  c  ]ijal  to  aac":  eater  than  mm,  respectively,  and 
let  ni^,.^i.^  :  Hs  be  th?;  sizes  of  these  sets. 

let  Vi,  V2,  Vs  bt-  the  sets  of  weights  associated 
with  eac'j  of  the  pointo  in  each  set,  and  wtsumi,wt8umn  and  wlsum^ 
be  the  sums  of  the  weights  of  the  points  in  each  of  the  sets. 

if  wtaumi+  batance>wt8um2+  wtsum^  then 

TetuTn(WTSELECT(S  i,Vi,batance-wt3um2-wt8umi,ni)) 
else  if  wt3umi+  wt8um2+  balattce>wtsum3  then  return(mm) 

else  tet\iTii('WTSELECT(Ss,Vi,balance+  wtsumi+  wiaumo.nj)) 

end 

The  correctness  of  the  algorithm  is  easily  verified,  and  the  argument  that  it  runs  in  0(n) 
time  follows  the  standard  one  for  the  unweighted  median  algorithm  [l].  Note  that  VVTSELECT 
reduces  to  SELECT  when  all  the  {w,}  are  equal  to  one,  with  balance  playing  the  role  of  r!-2jfc 
when  the  initial  set  size  is  even,  and  n+  1-2*  when  the  initial  set  size  is  odd. 

Space  requirements  for  the  fast  median  algorithms  are  considerably  higher  than  for  the 
O(nlogn)  heapsort-variant  algorithms.  Procedure  WTSELECT  requires  the  following  data  struc- 
tures, with  space  requirements  in  parentheses: 

X(n),  W(n),  Si(^),  VA,  sA  vA  MEDIANSTRIP(U  ) 


Note  that  Sj  and  5s  need  be  only  3/4  the  size  of  the  original  set,  since  the  algorithm  guarantees 
that  at  most  3/4  of  the  points  are  less  than  the  median  of  the  medianstrip  and  at  most  3/4  of  the 
points  are  greater  than  the  median  of  the  medianstrip.  Moreover,  virtually  no  space  ncd  be  allo- 
cated to  S2,  since  all  the  elements  of  S2  are  the  same,  namely  mm.  The  only  information  that 
must  be  stored  is  the  number  of  elements  in  5:  and  the  sum  of  the  associated  weights. 

In  addition,  procedure  WTSELECT  invokes  the  fast  unweighted  median  routine  SELECT 
which  has  the  following  space  requirements: 

S(m),  S,{^),  5s(-^),  MEDIANSTRIP(I^I) 


W 


Of  course,  when  SELECT  is  called  by  WTSELECT,  m  = 


W 


3.  Experimental  Results  from  Implementing  Linear-Time  Median  and  Weighted 
Median  Algorithms. 

The  linearity  of  the  median  algorithms  depends  on  their  implementation  in  a  way  which 
minimizes  the  number  of  comparisons.  A  merge-insertion  sort  was  used  to  sort  the  individual 
6ve-element  strips  by  using  only  seven  comparisons.  This  sort  is  optimal  in  terms  of  number  of 
comparisons,  since  it  achieves  the  information-theoretic  lower  bound  of  flog"!!  comparisons  (see 
[6]).  Inserting  the  median  of  the  medianstrip  into  each  of  the  individual  five-element  strips  can  be 
done  with  two  comparisons  by  using  binary  insertion. 

In  order  to  improve  the  running  time  of  these  procedures,  we  eliminated  the  recursion  by 
explicitly  using  a  stack.  Procedure  WTSELECT  is  only  tail-recursive,  i.e.,  the  recursive  call  to 
WTSELECT  is  the  last  instruction  in  the  body  of  the  p.»-ocedure;  SELECT,  the  regular  median 
routine,  has,  however,  both  tail  recursion  and  a  recursive  call  to  find  the  median  of  the  median- 
strip.  Eliminating  tail  recursion  is  trivial.  All  that  needs  to  be  done  is  to  reinitialize  the  parame- 
ters to  the  values  of  the  new  call  and  transfer  control  to  the  st?jt  of  the  procedure.  Eliminating  a 
recursive  call  from  the  middle  of  a  procedure  is  more  complicated  since  upon  return  from  the 
recursive  call  the  procedure  must  resume  comruirti.i  ■hh  ':■-..  o  Irjnal  parameters  and  local  vari- 
ables. In  general,  the  recursive  call  will  :iMr%e  tht^  vaJv?s  oi  ihe  variables  it  accesses.  Therefore, 
the  current  values  of  the  parameters  ■^"i  ■  \  local  vaii?b'es  ,nust  be  saved  on  a  stack.  Then  the 
recursive  call  can  be  initiated  by  re.  aliz  .•?  paramfrts^s  r nd  transferring  control  to  the  start  of 
the  procedure.  In  this  way,  upon  re*i  kn  from  a  fccunivc  call  the  values  of  the  parameters  and 
local  variables  at  a  previous  level  of  lecussion  can  be  retrie  -ed  bv  popping  the  stack  and  the  cal- 
ling environment  can  be  restored. 

The  computational  experiments  reported  below  were  porformed  using  the  VMS  Fortran 
compiler  with  optimization  on  a  VAX  machine  ac  the  Courant  Mathematics  and  Computing 
Laboratory.  Figure  1  gives  the  CPU  time,  in  seconds,  for  calls  to  the  hespsort-variant  procedure 
and  to  WTSELECT  for  n<5000.  The  results  were  disappointing.  WTSELECT  certainly  did  not 
seem  to  be  linear  and  was  consistently  slower  than  the  heapsort  method. 


n 

heapsort 

WTSELECT 

100 

.02 

.03 

500 

.13 

.18 

750 

.19 

oo 

1000 

.27 

.33 

.62 

.75 

3000 

1.00 

1.20 

4000 

1.37 

1.65 

5000 

1.96 

2.28 

Figure  1 

Since  the  linear  weighted  median  algorithm  is  a  modification  of  the  linear  unweighted 
median  algorithm,  we  compared  the  CPU  time  for  calls  to  SELECT  and  the  heapsort  routines  for 
the  unweighted  median  (Figure  2).  The  "fast"  regular  median  routine  also  did  not  appear  to  be 
linear,  and  in  fact,  compared  even  less  favorably  to  the  heapsort  method. 
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n 

heapsort 

SELECT 

100 

.02 

.02 

500 

.07 

.13 

750 

.14 

.29 

1000 

.19 

.41 

2000 

.38 

1.14 

3000 

.61 

1.65 

4000 

.86 

2.30 

5000 

1.06 

2.95 

Figure  2 

We  then  tried  various  methods  of  speeding  up  the  "fxst"  median  algorithms.  The  fast  algo- 
rithms reduce  the  problem  of  finding  the  median  of  the  original  set  to  that  of  finding  the  median 
among  the  elements  of  either  5,  or  S^  (the  subsets  of  the  original  set  consisting  of  those  elements 
less  than  or  greater  than  the  median  of  the  medianstrip).  In  order  to  simulate  the  recursive  call 
to  find  the  median  of  S^  or  Sj,  the  values  of  that  subset  must  be  copied  into  the  memory  space 
used  for  the  input  parameter  before  control  can  be  transferred  to  the  start  of  the  procedure. 
Since  the  copying  of  these  values  ;s  time^onsuming,  we  considered  using  a  two-dimensional  array 
to  store  the  original  set,  S/aiid  \^j,  eac„/as  one  Vow ^ of  the  two-dimensional  structure.  The  array 
has  three  rows,  each  of  length  n.  One  row  is  usei"  for  the  initial  set,  one  for  Sj  and  one  for  S^. 
When  the  procedure  is  called  recursively,  the  inde  of  he  row  containing  the  initial  set  is  inter- 
changed with  the  index  of  the  row  containing  th'^  subocK  in  which  the  median  is  to  be  found. 
This  necessitates  allocating  more  Vpace  thaa  is  uecessai-y  for  S^  and  5j  but  reduces  the  copying  of 
Si  or  5j  into  another  location  at  the  time  of  a  recursive  call  to  a  simple  switch  of  indices.  This 
strategy  did  accelerate  the  relative  time  of  the  fast  algorithm  (i.e.,  the  ratio  of  CPU  times  of  the 
"fast"  algorithm  to  heapsort)  but,  overall,  both  proceeded  at  a  slower  rate,  possibly  due  to  the 
increase  in  storage  requirements  (Figure  3).  This  modification  was,  therefore,  not  incorporated  in 
later  versions  of  the  procedure. 


n 

heapsort 

modified  SELECT 

100 

.03 

.03 

500 

.15 

.14 

750 

.23 

.30 

1000 

.32 

.52 

2000 

.68 

1.23 

3000 

1.07 

2.01 

4000 

1.46 

2.89 

5000 

1.87 

3.62 

Figure  3 


The  next  modification  that  was  tried  was  to  divide  the  set  into  strips  of  seven  elements 
rather  than  strips  of  five  elements.  Again,  a  merge-insertion  sort  was  used  to  sort  the  seven- 
element  strips  in  13  comparisons  (which  achieves  the  lower  bound  of  flognll  ,  see  (6|  ).  In  this 
case,  the  recursive  problem  is  only  one  seventh  the  size  of  the  original  problem.  This  method  did 
indeed  prove  to  be  faster  than  the  original  method  of  using  five-element  strips  (Figure  4). 


SELECT 

SELECT 

n 

heapsort 

7-e!ement  strips 

5-eIement  strips 

100 

.02 

.03 

.02 

500 

.09 

.17 

,13 

750 

.12 

.27 

.29 

1000 

.17 

.45 

.41 

2000 

.41 

1.10 

1.14 

3000 

.61 

1.61 

1.65 

4000 

.39 

2.28 

2.30 

5000 

1.08 

2.86 

2.95 

Figure  4 

At  this  point  we  decided  to  experiment  with  larger  sets  of  points  and  see  whether  for  still 
larger  n  the  situation  would  improve.  We  increased  n  tc  30,000  and  found  that  for  n>!==2S000 
the  "fast"  unweighted  median  algorithm  was  actuaiijf  faster  thau  the  heapsort-variant  algorithm. 
For  the  weighted  case,  the  results  were  much  better.  For  n>:=:i6000  the  "fast"  algorithm  proved 
faster  than  the  heapsort  method  (Figure  5). 


n 

unweighted  nudiai- 

weighted  median 

heansort 

SELECT 

heapsort                    WTSELECT 

1000 

.21 

.51 

.25           :                    .25 

5000 

1.09 

2.81 

2.04 

1.95 

10000 

2.49 

6.43 

10.49 

4.25 

150C0 

6.10 

10.45 

21.54 

6.73 

20000 

11.08 

M.37 

35.06 

9.15 

25000 

17.48 

18.88 

40.19 

12.05 

26000 

18.49 

20.  ai 

52.16 

12.70 

27000 

19.87 

20.50 

54.53 

1309 

28000 

20.93 

20.95 

55.97 

13.76 

29000 

22.40 

22.20 

1:01.21 

14.37 

30000 

23.51 

23.15 

1:01.98 

14.93 

Figure  5 


There  is  a  striking  difference  between  the  ratios  of  the  CPU  time  for  the  "fast"  algorithm  to 
the  CPU  time  for  the  heapsort-variant  algorithm  in  the  weighted  median  and  regular. median 
cases.  There  seem  to  be  two  factors  that  contribute  to  this  discrepancy:  (1)  The  standard 
heapsort-variant  algorithm  for  the  weighted  case  actually  does  more  work  than  the  standard 
heapsort-variant  algorithm  for  the  unweighted  case.  (There  is  an  extra  0(n)  loop  jnst  to  calculate 
the  sum  of  the  weights.)  However,  this  should  be  balanced  by  the  extra  work  done  in 
WTSELECT  over  that  done  by  SELECT.  (2)  As  mentioned  above,  the  structure  of  the  linear 
weighted  median  algorithm  is  different  than  that  of  the  linear  regular  median  algorithm  in  that  all 
recursive  calls  are  tail-recursive.  As  previously  explained,  simulating  a  tail-recursive  call  does  not 
involve  manipulating  a  stack.  Therefore,  procedure  WTSELECT  itself  does  not  push  or  pop  ele- 
ments on  a  stack,  although  it  does  call  procedure  SELECT  to  find  the  median  of  the  medianstrip. 
Procedure  SELECT  does  contain  an  embedded  recursive  call  to  itself  and  therefore  docs  make  use 
of  a  stack.  However,  when  SELECT  is  called  from  WTSELECT,  the  problem  size  is  1/7  the  size 
of  the  original  problem.  Therefore,  the  number  of  elements  to  be  stacked  by  the  weighted  median 
routine  is  a  fraction  of  the  number  of  elements  that  are  stacked  by  the  unweighted  median  rou- 
tine, for  the  same  initial  set. 


With  regard  to  the  second  consideration,  we  implemented  a  new  version  of  the  fast  regular 
median  algorithm.  This  new  version  has  a  structure  isomorphic  to  that  of  WTSELECT.  The 
new  version  of  SELECT  is  only  tail-recursive,  with  the  recursive  call  to  calculate  the  median  of 
the  medians  being  replaced  by  a  call  to  a  procedure  "SELECT2"  which  is  essentially  a  copy  of 
the  original  procedure  SELECT  (i.e.,  it  contains  two  recursive  calla  to  itself).  As  expected,  this 
did  make  the  linear  regular  median  routine  faster.  The  new  routine  proved  faster  than  the 
heapsort-variant  routine  for  n>c=!l3,000.  For  n=30,000  the  CPU  time  for  SELECT  was 
reduced  to  9.38  seconds. 

We  then  tested  whether  further  refinements  would  further  accelerate  the  running  time  of  the 
algorithm.  We  let  SELECT2  itself  be  only  tail-recursive,  calling  SELECTS  instead  of  using  a 
recursive  call  to  itself  to  find  the  median  of  the  medianstrip.  Again,  this  did  result  in  further 
speed-ups.  For  n>=23000  the  fast  routine  proved  faster.  For  n  =30,000  the  CPU  time  of 
SELECT  was  reduced  to  7.45  seconds. 

Taking  this  one  step  further,  v.e  found  that  if  we  let  SELECT3  call  SELECT4,  the  fa.st 
algorithm  becomes  faster  than  the  heapsort  algorithm  for  ««2000.  At  30,000  the  CPU  time  was 
6.78.   For  set  sizes  of  30,000  one  cannot  go  any  further,  since  the  rc.ur'iive  caJl  in  SELECT4  is  a 

call  on  a  problem  the  size  of  the  original  set,  which  would  mean  a  set  size  of  12,  at  v/hich 

2401 
point  there  are  no  further  recursive  levels"    The  recursive  call  to  find  the  median  of  the  median- 
strip  involves  solving  a  subproblerrf  ot  1/7  the  sfze  of  the  original  problem.  For  n<64  the  heap- 
sort  method  is  known  to  require  feweir  comparisons  than  the  fast  method  and  is  therefore  used  to 
solve  thej)roblem.  Thus,  the  maximam  number  of  recursive  levels  invoked  for  an  initial  set  of 


Of  course,  these  refinements  involve  a  time/space  tradeoff.  These  modified  algorithms 
involve  maintaining  a  few  almost  identical  copies  of  SELECT  in  one  program.  The  slight  reduc- 
tion in  time  at  the  expense  of  considerable  increase  in  space  requirements  is  of  questionable  value. 
Obviously,  using  one  of  these  variations  of  SELECT  for  the  call  in  the  weighted  median  routine 
would  further  improve  the  time  efficiency  of  WTSELECT. 

Figure  6  compares  the  actual  number  of  comparisons  done  by  the  median  routines.  It  is 
empirically  clear  that  the  number  of  comparisons  done  by  the  fast  algorithms  is  linear  in  n. 


n 

unweighted  median 

weighted  median        | 

heap!=ort 

SELECT 

heapsort 

WTSELECT 

100 

702 

404 

771 

758 

1000 

10208 

5691 

11349 

4986 

5000 

63454 

27761 

GQ097 

38353 

lOCOO 

137286 

56841 

147359 

78124 

15000 

214008 

85464 

227543 

119236 

20000 

20-1378 

113300 

313347 

160217 

25000 

376130 

143837 

399795 

199309 

30000 

457862 

170912 

487901 

241311 

Figure  6 


In  conclusion,  it  appears  that  for  small  set  sizes  the  heapsort-variant  median  finding  algo- 
rithms are  faster  than  the  linear-time  median  routines.  However,  for  large  n,  when  the  set  con- 
tains many  thousand  data  points,  the  fast  median  algorithms  perform  better  than  the  standard 
heapsort  methods  and  for  very  large  n  the  time  savings  can  be  considerable. 
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